A novel contextual topic model for multi-document summarization

نویسندگان

  • Guangbing Yang
  • Dunwei Wen
  • Kinshuk
  • Nian-Shing Chen
  • Erkki Sutinen
چکیده

Information overload becomes a serious problem in the digital age. It negatively impacts understanding of useful information. How to alleviate this problem is the main concern of research on natural language processing, especially multi-document summarization. With the aim of seeking a new method to help justify the importance of similar sentences in multi-document summarizations, this study proposes a novel approach based on recent hierarchical Bayesian topic models. The proposed model incorporates the concepts of n-grams into hierarchically latent topics to capture the word dependencies that appear in the local context of a word. The quantitative and qualitative evaluation results show that this model has outperformed both hLDA and LDA in document modeling. In addition, the experimental results in practice demonstrate that our summarization system implementing this model can significantly improve the performance and make it comparable to the state-of-the-art summarization systems. While the rapid growth of the World Wide Web has resulted in bringing people from different parts of the world much closer and able to access to a vast amount of information on their fingertips, it has also created a serious problem of information overload, which impacts people negatively in comprehending useful information. How to alleviate this problem is of concern to the research on automatic text summarization, especially the extraction based multi-document summarization. The main task of the extraction based multi-document summarization is to extract the most important sentences from multiple documents and format them into a summary. Therefore, finding an appropriate method to justify the importance (or relevance) of a string of text (e.g., a sentence) dominates this research area. Many proposed approaches use statistical methods, lexical chains, graph-based algorithms, or Bayesian language models to produce summaries. For example, a well-known summarizer, SumBasic in statistical methods, specifies the importance of a sentence in a document by counting term frequency or inverse document frequency (TF-IDF) exclusive of stop words & Nenkova, 2007). Others identify the relevance of a sentence by using bigram pseudo sentences for using hybrid statistical sentence extraction (Ko & Seo, 2008), rhetoric-based multi-document summarization (Atkinson & Munoz, 2013) and semantic document concept technique (Ye, Chua, Kan, & Qiu, 2007) based on the rhetorical structure theory (RST) (Mann & Thompson, 1988) for analyzing grammatical structures in discourses. However, heavy reliance on human expert's rhetorical roles and linguistic knowledge bases is definitely a bottleneck for RST based approaches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Feature-based Bayesian Model for Query Focused Multi-document Summarization

Supervised learning methods and LDA based topic model have been successfully applied in the field of multi-document summarization. In this paper, we propose a novel supervised approach that can incorporate rich sentence features into Bayesian topic models in a principled way, thus taking advantages of both topic model and feature based supervised learning methods. Experimental results on DUC200...

متن کامل

Multi-Document Summarization using Sentence-based Topic Models

Most of the existing multi-document summarization methods decompose the documents into sentences and work directly in the sentence space using a term-sentence matrix. However, the knowledge on the document side, i.e. the topics embedded in the documents, can help the context understanding and guide the sentence selection in the summarization procedure. In this paper, we propose a new Bayesian s...

متن کامل

A Hybrid Topic Model for Multi-Document Summarization

Topic features are useful in improving text summarization. However, independency among topics is a strong restriction on most topic models, and alleviating this restriction can deeply capture text structure. This paper proposes a hybrid topic model to generate multi-document summaries using a combination of the Hidden Topic Markov Model (HTMM), the surface texture model and the topic transition...

متن کامل

An Aspect-Driven Random Walk Model for Topic-Focused Multi-document Summarization

Recently, there has been increased interest in topic-focused multi-document summarization where the task is to produce automatic summaries in response to a given topic or specific information requested by the user. In this paper, we incorporate a deeper semantic analysis of the source documents to select important concepts by using a predefined list of important aspects that act as a guide for ...

متن کامل

Comparative Summarization via Latent Dirichlet Allocation

This paper aims to explore the possibility of using Latent Dirichlet Allocation (LDA) for multi-document comparative summarization which detects the main differences in documents. The first two sections of this paper focus on the definition of comparative summarization and a brief explanation of using the LDA topic model in this context. In the last three sections, our novel method for multi-do...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Expert Syst. Appl.

دوره 42  شماره 

صفحات  -

تاریخ انتشار 2015